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Abstract 



A projection-based immersed boundary method is dominated by sparse linear algebra routines. Using the open- source Cusp library, 
we observe a speedup (with respect to a single cpu core) which reflects the constraints of a bandwidth-dominated problem on the 
gpu. Nevertheless, gpus offer the capacity to solve large problems on commodity hardware. This work includes validation and a 
convergence study of the gpu- accelerated ibm, and various optimizations. 

Keywords: Immersed Boundary Method, Computational Fluid Dynamics, GPU Computing 



o 

<N 

Oh 
0) 



u 

o 



> 

m 
o 



1. Introduction 

The immersed boundary method (ibm) refers to a general 
class of techniques in computational fluid dynamics, character- 
ized by the solution of the incompressible Navier-Stokes equa- 
tions on a grid that does not conform to the immersed solid 
body present in the fluid. Conventional CFD techniques require 
the generation of a mesh that conforms to the geometry of the 
domain on which the equations are solved. In the immersed 
boundary method, the fluid is represented by an Eulerian grid 
(typically a Cartesian grid) and the solid boundary is repre- 
sented by a collection of Lagrangian points. This has several 
advantages. Mesh generation is trivial, and simulations involv- 
ing moving solid bodies and boundaries are made simpler. The 
Navier-Stokes equations are solved on the entire grid (including 
points within the solid), and the effect of the solid body is mod- 
elled by adding a singular force distribution f along the solid 
boundary which enforces the no-slip condition. The governing 
equations are, 



du 

~dt 



+ u • Vu = -Vp + vAu 

+ J ms,t))s(t-x)ds 

V-u = 

u(£(s,t)) = J u(x)6(x-g)dx 



(la) 
(lb) 

(lc) 



where u B is the velocity of the body at the boundary point lo- 
cations. The different ibm formulations use different techniques 
to calculate this forcing term. 

The immersed boundary method was introduced in 1972 by 
C S Peskin [21] to model blood flow through the elastic mem- 
branes of the heart. The technique has since been extended 



to simulate rigid bodies, and the ibm has seen renewed inter- 
est in recent times. Mittal and Iaccarino [17] provide a good 
overview, discussing various techniques that have been used to 
impose the no- slip condition on immersed boundaries. 

Peskin [21] proposed modelling the body as a collection of 
springs, with boundary points placed at the equilibrium posi- 
tions of the springs and allowed to move with the flow, and the 
force calculated using Hooke's law. The force is transferred 
from the solid to the fluid grid by using a well-chosen dis- 
cretized approximation to the Dirac delta function. A second- 
order accurate extension to this method was developed in [12]. 

Goldstein et ah [6] generalised this technique by using a 
singular force at the boundary points, given by F(x s ,t) = 
a J (u(x s , t)-\(x S9 t))dt+/3(u(x s , t)-\(x s , t)), where a and J3 are 
constants that are chosen depending on the physical characteris- 
tics of the flow. This forcing function enforces the no-slip con- 
dition at the solid surface by means of a feedback loop. Saiki 
and Biringen [23] used this technique in conjunction with the 
finite difference method. Note that we obtain the same method 
as [12] by setting /3 to zero. 

These methods are easy to implement but have some short- 
comings. The no-slip condition is not directly applied, and it 
takes some time for the system to attain it. The algorithms also 
depend on ad hoc parameters that need to be chosen carefully 
depending on the physics of the flow and the desired accuracy. 
The typically large values of the parameters also place a very 
strict restriction on the time step due to stability constraints. 

Mohd-Yusof [19] formulated a method in which the forc- 
ing term is directly obtained from the discretised Navier-Stokes 
equations by assuming no-slip at the solid boundaries. The ve- 
locity at the grid points near the boundary is estimated via in- 
terpolation and the forcing term is calculated by 



f = + u • Vu) + Vp - /./Au. 



(2) 
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This technique was implemented by Fadlun et al. [4]. An im- 
plicit scheme to calculate the diffusion and the forcing term al- 
lowed large CFL numbers of the order of 10" 1 to be used. This 
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method does not exactly satisfy the no-slip condition, but the 
errors were shown to be small. Kim et al. [10] used a similar 
idea, with the addition of a mass source at the boundaries. The 
forcing term and mass sources were calculated explicitly, which 
placed a stricter restriction on the time step. 

Uhlmann [29] reported that the method in [4] produced a 
boundary force that was not smooth in time when it was used 
to solve flows with moving boundaries. To fix this, he recom- 
mended calculating the singular force distribution at the La- 
grangian body points and then distributing the force on the Eu- 
lerian grid by using a discrete delta function, rather than cal- 
culating the forcing term directly. This method too does not 
satisfy the no-slip condition exactly and has the stricter time 
step restriction, but eliminated the force oscillations. 

The immersed boundary projection method was introduced 
by Taira and Colonius [24], based on an extension of the projec- 
tion method [2, 9, 20] used to solve the incompressible Navier- 
Stokes equations. Using the idea that the pressure in incom- 
pressible flows acts as a Lagrangian multiplier to ensure the 
zero-divergence condition, they similarly consider the forcing 
term as Lagrangian multiplier that ensures the no- slip condi- 
tion on the immersed boundary. During the projection step, the 
velocity field is adjusted such that both these constraints are 
satisfied exactly. An implicit scheme used for diffusion allows 
the usage of relatively large time steps, with CFL numbers as 
high as 1 . They later used a nullspace approach that sped up the 
simulation by up to two orders of magnitude [3]. This approach 
essentially uses a spectral method to solve the equations cast in 
the velocity- vorticity formulation. 

The techniques described so far make use of some kind of in- 
terpolation to transfer the singular forces at the boundary to the 
Eulerian grid on which the Navier-Stokes equations are solved. 
This causes the boundary to appear diffused and affects the ac- 
curacy of the boundary layer. Boundary layers need to be re- 
solved more accurately at higher Reynolds numbers, and so 
a number of sharp -interface methods have been developed to 
handle this problem. Among them is the cut-cell method. Cells 
near the boundary are cut to shapes that conform to the body, 
and the rest of the grid is Cartesian. The Navier-Stokes equa- 
tions are solved using the finite volume method in all the cells 
that contain the fluid and no forcing term is used. Both local and 
global conservation of properties in the domain can be ensured. 
This technique has been used by Ye et ah [32], Udaykumar et 
al [26, 27, 28], Mittal et al [16, 18] and Marella et al [15]. 
Extending this procedure to three dimensions is non-trivial due 
to the complex shapes that can be formed by cutting. 

Another sharp-interface method is known as the ghost-cell 
method. Here, an interpolation scheme is used at the interface 
to calculate flow properties at the grid points, implicitly taking 
into account the boundary conditions. This technique has been 
used by Majumdar et al. [14], Iaccarino and Verzicco [7], Ghias 
et al. [5] and Kalitzin et al. [8], the last two of whom carried 
out turbulence simulations. 

In the present work, we use the algorithm presented in [24] 
for the solution of two-dimensional incompressible viscous 
flows with immersed boundaries, which is explained in detail 
in §2. We have produced an implementation on gpu hard- 



ware, studying techniques that provide good performance in 
this multi-threaded architecture, which are described in some 
detail. The various challenges that an efficient implementa- 
tion on gpu pose are representative of those that practitioners 
of other CFD methods would face. Thus, we hope to contribute 
to the ongoing investigations on the use of gpu hardware for 
computational fluid dynamics. To our knowledge, the ibm has 
not previously been implemented on the gpu. The perspective 
of doing so is the capacity of solving large three-dimensional 
moving boundary problems on commodity hardware. 

2. Immersed Boundary Projection Method 

We implement the method proposed by Taira & Colonius 
[24]. The Navier-Stokes equations (la)-(lc) are discretised on 
a staggered cartesian grid. The grid can be non-uniform, and it 
is advantageous to stretch the grid away from the high vorticity 
regions to reduce the number of grid points without sacrificing 
accuracy. We obtain the following set of algebraic equations: 



Au n+l - f n = -G(f) + bci + Hf 



Du n+l = bc 2 
Eu n+l = u n B + \ 



(3) 



which can be written as: 
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where and / are vectors whose contents are the pressure and 
the components of the forces at the immersed boundary points 
respectively. The velocity at the current time step u n is known. 
be i and bc 2 are obtained from the boundary conditions on the 
velocity. 

H and E are the regularisation and interpolation matrices re- 
spectively, which are used to transfer values of the flow vari- 
ables between the Eulerian and Lagrangian grids. The interpo- 
lation in (lc) can be discretised as: 



u k 



= ^ Uid{xi - % k )d(yi - 7j k )AxAy, 



(5) 



where is the velocity at a point (^k^k) on the immersed 
boundary. Uk is calculated by convolving the velocities on 
the Eulerian grid U{ at locations feyO with a discrete two- 
dimensional delta function. The discrete delta function is 
a product of smoothed one-dimensional delta functions <4(r) 
along each Cartesian direction. We choose the one used by 
Roma et al [22] : 



d h (r) ■ 



'^(5-3£-V-3(l-£) 2 + l), 0.5 < 2*1.5 



1 + 

otherwise, 



< ^<0.5 



where h is the cell width. Use of the discrete delta function re- 
quires the grid to be uniform near the immersed boundary. Suf- 
ficient number of boundary points need to be chosen to prevent 
flow leakage, and the spacing between boundary points should 
be around the same as the cell width of the Eulerian grid. 
From the above, we can calculate the elements of matrix E: 



AxAyd(Xi - £k)d(yi - r] k ). 



(6) 



The matrix H is similarly determined by discretising the forc- 
ing term in (la). 

The explicit second-order Adams-Bashforth scheme is used 
to discretise the convection terms and the Crank-Nicolson 
scheme is used for diffusion. All spatial derivatives are cal- 
culated using central differences. Note that no explicit pres- 
sure boundary conditions need to be specified. The pressure 
and body forces are calculated implicitly. The above system 
of equations can be solved to obtain the velocity field at time 
step n + 1, the pressure (to a constant) and the body forces. But 
the left-hand side matrix is indefinite, and solving the system 
directly is ill-advised. 

By performing appropriate transformations (see Appendix of 
[24] for details), we can show that the above system is equiva- 
lent to: 



(7) 



where q n+l is the momentum flux at each cell boundary, and 
the sub-matrices are transformed versions of those in (4). We 
can rewrite the above by combining some of the sub-matrices 
in the following manner: 
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which gives us 



A 
Q T 



Q 





(8) 



(9) 



Applying the approximate factorisation described in [20] to 
the above system, we obtain the following set of equations 
which can be solved to give us the velocity distribution at time 
step n + 1 : 



Aq = n 
Q T B N QA = Q T q* - r 2 
q n+l =q*- B N QA 



(10a) 
(10b) 
(10c) 



where B N is an N th order approximation of A~ l . This must be 
taken into account while estimating the overall time accuracy of 
the method. 

This factorisation is very advantageous as the two linear sys- 
tems (10a) and (10b) that we now need to solve can be made 



positive definite, and can be solved efficiently using the conju- 
gate gradient method. In the absence of an immersed boundary, 
this set of equations is the same as that solved in the traditional 
fractional step method or projection method [20, 9, 2]. The 
final equation (10c) simultaneously ensures a divergence-free 
velocity field and the no- slip condition in the next time step. 
By virtue of the above formulation, no special boundary con- 
ditions need to be derived for q* or A. As is usual, one value 
of pressure needs to be pinned to obtain a unique solution since 
the left hand side matrix Q T B N Q in (10c) has one eigenvalue 
which is zero. 

3. Assessment of iterative solvers for the ibm 

Like a majority of numerical methods used in computational 
mechanics, the ibm requires tools for sparse linear algebra. The 
matrices A, Q and Q T are sparse, the vectors q n+l and A are 
dense, and we need to operate on these objects via matrix- 
vector, matrix-matrix, and even a triple-matrix multiplication. 
To take advantage of the gpu, we need efficient means of both 
representing and operating on these matrices and vectors on the 
device. Currently, there are two tools available for this pur- 
pose: cusparse, part of nvidia's cuda, or the external library, 
Cusp. The Cusp library is being developed by several nvidia 
employees with minimal software dependencies and released 
freely under an open-source license. We chose to use the Cusp 
library for several reasons: it is actively developed and separate 
from the main cuda distribution, allowing for faster addition 
of new features (such as new pre-conditioners, solvers, etc.); 
and, all objects/methods from the library are usable on both cpu 
and gpu. This allows us the flexibility to, for example, perform 
branching-heavy code on the cpu, before trivially transferring 
to the device and running (for instance) a linear solve, where it 
will be significantly faster. It also allows us to maintain both a 
cpu and gpu code with less effort. 

To illustrate the importance of having efficient linear algebra 
methods, we show in Figure la a breakdown of the timings from 
an example run on the gpu the with the ibm (4000 time steps of 
a flapping airfoil at Re = 75). The mesh consists of 930 x 654 
cells, resulting in systems of over 600, 000 unknowns. Even 
in this moderately sized test, the runtime is dominated by the 
solution of the coupled linear system for pressure and forcing 
terms, denoted by 'Solve 2'. Speeding up this linear solve is the 
major motivation for using the gpu. 

At this point, it is interesting to consider the potential bene- 
fits of using the gpu for our linear systems. Sparse linear sys- 
tems are a well known example of a bandwidth-limited prob- 
lem — the potential speedup from using a gpu will at best be 
the ratio of the bandwidths of the cpu and gpu used. For the 
Intel Xeon X5650 processors in our workstation, Intel quotes 
a maximum bandwidth of 32 GB/s 1 , while nvidia quotes 144 
GB/s for the Tesla C2050 cards we use 2 . Using these num- 
bers, we can see that the best-case speedup should be 4.5 x for 



1 http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-(12M- 
Cache-2_66-GHz-6_40-GTs-Intel-QPI) 

2 http://www.nvidia.com/docs/IO/43395/NV JDS_Tesla_C2050_C2070_jullO_lores.pdf 
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Figure 1 : (a) Timing breakdown for a flapping airfoil at Re = 75 using the gpu code, (b) Comparison of time taken to solve a system of linear 
equations Ax = b on the cpu and gpu. A is chosen as the standard 5-pt Poisson stencil. 



a purely bandwidth-bound problem, while for computationally 
bound problems the speedup can be much higher. In practice, 
even the worst case speedup is definitely worthwhile, certainly 
enough to justify the extra complexity involved in using the gpu. 

Figure lb shows a timing comparison between the cpu and 
gpu using Cusp's conjugate gradient solver. The system be- 
ing solved in this case is given by a traditional 5 -point Poisson 
stencil, which while not directly used in the ibm code, gives a 
good measure of relative performance that can potentially be 
obtained. The plot shows the wall-clock time required to solve 
to a relative accuracy of 10" 5 for numbers of unknowns rang- 
ing from 2500 to 4 x 10 6 . For large systems, the gpu solve is 
significantly faster, with a speedup of 8x for the largest system 
shown. This indicates that even though our actual system may 
be much harder to solve, the potential speedup is significant. 

For the particular linear system we're interested in, we also 
have to take into account that it has an unusual non-zero struc- 
ture, shown in Figure 2, due to the coupled nature of the vari- 
ables being solved for (pressure and forcing terms). This cou- 
pled nature leads to conflicting approaches being optimal for 
different sections of the matrix. The pressure terms are a 
standard 5-point Poisson stencil (to be extended to 7-point in 
3D), and as such would be a good candidate for a hierarchical 
method using a smoother, such as algebraic multigrid (AMG). 
However, the forcing terms and coupling sections of the matrix 
would be unsuitable for this kind of approach. Instead, a stan- 
dard solver such as CG (Conjugate Gradient) would be better 
suited. 

To tackle this difficulty, we can choose one of two possible 
approaches: (i) use a suitable pre-conditioner for the system to 
make it better suited for solution with a standard iterative solver, 
most likely CG, or ( ii) investigate other forms of solvers, such 
as AMG. While option ( ii) could potentially have the greatest 




Figure 2: Example of sparsity pattern for Q T B N Q matrix, showing 
the coupling terms in the right and lower sections of the matrix 



effect on both convergence rate and total time to solution, our 
choices are somewhat limited, given the tools we are using. Ei- 
ther we rely on the smoothed aggregation AMG that is part of 
Cusp, or we would need to develop our own from scratch — at 
this time there are no other GPU-enabled AMG codes available 
for public use. 

To provide further insight into our options, we can test repre- 
sentative matrices from different applications of our code (with 
differing complexities of immersed boundaries) and compare a 
standard CG solver against the form of AMG available in Cusp, 
an aggregative AMG. The CG solver will be used both without 
a preconditioner, as well as using the diagonal and smoothed- 
aggregation preconditioner s from Cusp. For each problem, we 
show the wall-clock time taken to solve the system to a rela- 
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tive accuracy of 10" 5 , the same accuracy used in previous tests. 
A summary of the matrices produced from each system can be 
seen in Table 1, and the timing results are shown in Figure 3. 

From these results, it can be plainly seen that PCG with 
smoothed-aggregation AMG as the preconditioner is the fastest 
on all tests, and so this is the combination of solver and precon- 
ditioner used in all our subsequent runs. 

Our choice of tools allows us to easily perform all sparse 
linear algebra operations on the gpu. On the other hand, there 
are parts of the algorithm that cannot easily be expressed us- 
ing linear algebra, such as generating the convection term using 
a finite-difference stencil and applying boundary conditions to 
the velocities (which involves modifying select values of ap- 
propriate arrays). One possible way of performing these ac- 
tions is to transfer data from the gpu, perform the calculations 
on the cpu and transfer the modified vector(s) back to the gpu 
every time step — but this incurs a prohibitively high cost in 
memory transfers. The alternative, which we have done, is to 
use custom- written cuda kernels utilizing all appropriate tech- 
niques, including the use of shared memory, to perform these 
operations on the gpu. This requires access to the underlying 
data from the Cusp data structures, which can be done easily 
using the Thrust library, on which Cusp was built. 

4. Strategy for gpu implementation 

Here we discuss some important implementation details that 
are used to decrease the total runtime, or to decrease the amount 
of memory needed, or both. 

• Keep everything on the gpu: To reduce costly memory 
transfers, everything that can be performed on the gpu is, 
regardless of its suitability. For instance, the code to en- 
force boundary conditions operates on very few values, 
has almost no computational complexity and is highly di- 
vergent based on the boundary conditions chosen (i.e., 
Dirichlet / Neumann, etc.). Thus, it will always perform 
rather badly on the device, giving essentially no speedup. 
However, it would take significantly longer to transfer the 
necessary data back to the host and perform the operation 
there. This means taking the time to write and test this 
kernel gives us an overall speedup, due to this elimination 
of transfers. This is only one example, but all but one el- 
ement of the total algorithm has been transferred, yielding 
significant savings in the total run time. The final element 
still on the cpu involves updating the forcing terms in g, 
and as such involves a very small number of entries com- 
pared to any other array. Thus, this transfer ends up being 
an inconsequential part of the total algorithm. 

• Triple Matrix Product: To generate the left-hand side of 
equation (10b), we need to multiply 3 sparse matrices, Q T , 
B N and Q. To do this within the current Cusp frame- 
work, we would need to perform two separate matrix- 
matrix multiplies, the first for Q T B N and then the result 
of this multiplied by Q. This method involves creating 



a temporary matrix in memory, and for higher-order ap- 
proximations of B N , say, 3 rJ -order, this temporary can be 
quite large, in the order of 700MB. Cusp performs matrix- 
matrix multiplies in the following form (using Matlab no- 
tation, where A [range , range] denotes a sub matrix of 
A, and the ' : ' symbol denotes all columns/rows depend- 
ing on context; thus, A [slice , :] refers to all columns in 
a range of rows denoted by slice) 
Require: Input sparse matrices A, B, result matrix C 

Split A into slices A [slice , :] s.t. each A [slice, :] -B 

fits in device memory 

for V slices do 

slice <— A [slice , : ] • B 
add slice to list of slices 

end for 

Assemble C from all computed slices 

The multiplication of A[slice,:] • B is here be- 
ing performed by a helper function within Cusp 
(cusp: : detail: : device: : spmm_coo_helper). 

We propose a routine to perform a triple matrix product 
of the form D <— A • B • C by using this helper function 
repeatedly to ensure the full intermediate product need not 
be calculated. We do this by realizing the following: 

temp_slice = A[slice,:]-B 
D [slice,:] = temp_slice • C. 

Thus, we can form slices of the final result while only stor- 
ing a slice as an intermediate by applying the helper func- 
tion twice in succession. This alleviates the need to create 
and store the full intermediate matrix. While at first glance 
the intermediate, B N Q might be computed once and used 
in both (10b) and (10c), we explain below how this term is 
unnecessary in (10c). 

• Use optimized routines where available: In equation 
(10c), we have to calculate the product B N QA, where B N 
and Q are sparse matrices, and A is a dense vector. If 
we implement this naively, we perform the matrix-matrix 
product of B N Q then multiply this with the vector A. How- 
ever, the Sparse matrix- vector product (SpMV) in Cusp 
has been optimized to a significant state, as demonstrated 
by the co-authors of Cusp[l]. Therefore, we prefer to im- 
plement this operation as a pair of SpMV operations, first 
QA resulting in a vector, then the result of this with B N , 
exchanging the matrix-matrix product with a SpMV. This 
also reduces the amount of memory needed, as we have no 
need to store the extra matrix obtained by B N Q. 

• Re-use the hierarchy generated by the smoothed- 
aggregation preconditioner: The most expensive part of 
our second solve, Equation (10b) for moving bodies, is the 
generation of the hierarchy for the smoothed aggregation 
preconditioner. When bodies are stationary, we have no 
need to recompute this at each time step, and can use the 
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Figure 3: Comparison of gpu solvers on real test problems - Conjugate Gradient, Preconditioned Conjugate Gradient with diagonal and Smoothed 
Aggregation preconditioners, and a full solve using Smoothed Aggregation 



System # 


Domain Size 


# Bdy Points 


# Unknowns 


# Non-zeros 


1 


330x330 


158 


109216 


554752 


2 


986 x 986 


786 


973768 


4914520 


3 


930 x 654 


101 


608422 


3045406 


4 


690 x 690 


474 


477048 


2412448 



Table 1 : Systems 1 and 2 correspond to flow over a cylinder at Re = 40 and 3000 respectively, system 3 is for a flapping- wing calculation, and 
system 4 is from a synthetic test with 3 moving cylinders. 
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same hierarchy for the entire simulation, effectively amor- 
tising the high cost over the total run. When we have 
moving bodies, this hierarchy should be recalculated ev- 
ery time step. We propose to instead re-use the hierarchy 
for several time steps. To do so, we need to investigate the 
balance of not performing the expensive calculation every 
time step, but potentially having a lower convergence rate 
for the steps where the hierarchy is re-used. For a flapping 
airfoil, Figure 4 shows the effect on the total run time of 
recalculating the preconditioner every n time steps, clearly 
showing that in order to reduce run time for this problem, 
we should recalculate the preconditioner every 2 or 3 time 
steps. 



1.25 



x 10 




4 6 
Time steps 

Figure 4: Timings for a flapping airfoil run, re-calculating the 
Smoothed Aggregation precondtioner every number of time steps. 



5. Validation 

5.7. Couette flow between concentric cylinders 

As a validation test, we calculate the flow between two con- 
centric cylinders of radius r t = 0.5 and r Q = 1 centered at the 
origin. The outer cylinder is held stationary while the inner 
cylinder is impulsively rotated from rest with an angular veloc- 
ity of £1 = 0.5. The cylinders are contained in a square station- 
ary box of side 1.5 centered at the origin. The fluid in the entire 
domain is initially at rest and the calculations were carried out 
for kinematic viscosity v = 0.03. 

The steady-state analytical solution for this flow is known. 
The velocity distribution in the interior of the inner cylinder is 
the same as for solid body rotation and the azimuthal velocity 
between the two cylinders is given by: 



u e (r) = Qr, 



(r /r - r/r ) 

(xoln-nlro)' 



(ID 



We compared this to the numerical solution for six different 
grid sizes ranging from 75 x 75 to 450 x 450. Table 2 shows 
the L 2 and L°° norms of the relative errors and Figure 5b shows 
that the scheme is first-order accurate in space, as expected for 
the ibm formulation we used. 





Order of convergence 


Order of convergence 


Time 


(N=l) 


(N = 3) 


0.8 


0.97 


2.67 


2 


0.99 


2.85 


4 


0.93 


2.73 


8 


0.97 


2.83 



Table 2: Calculated order of convergence at different times for 
Couette-flow validation. 



To verify the temporal order of convergence, we ran a simu- 
lation from t - to f = 8onal51xl51 grid, using different 
time steps (At = 0.01, 0.005 and 0.0025). Both first- and third- 
order accurate expansions of B N were used and the calculated 
orders of convergence (using the L 2 norms of the differences in 
the solutions) at various times have been summarised in Table 
2, and are as expected. 

5.2. Flow over an impulsively started cylinder 

We also carried out computations to simulate flow over an 
impulsively started circular cylinder at Reynolds numbers 40, 
550 and 3000. The cylinder is of diameter d = 1 centered at the 
origin and is placed in an external flow with freestream velocity 
Uoo = 1 . The domain considered was square with each side of 
length 30, centered at the origin. The fluid flows from left to 
right, and the velocity on the left, top and bottom edges was set 
to be the freestream velocity. A convective boundary condition 



(%+u 



— - 0) was used on the right edge. The initial velocity 



dt 1 1/1/00 dx 

is uniform throughout the domain. The minimum cell widths 
used near the solid boundaries were 0.02, 0.01 and 0.004 re- 
spectively for the three cases. Away from the body, the grid is 
an exponential stretched grid. More information about the grids 
is given in Table 3. 

The flow at ReAO reaches a steady state and the drag coeffi- 
cient was found to be 1.57, which matches the expected value 
[25]. The vorticity fields and the unsteady drag coefficients for 
the cases with Re550 and 3000 also agree well with past com- 
putations [11]. The results are presented in Figures 6-8. 



Re 


n x X rty 


Ax ■ 


Extent of uniform grid 


V stretching 


40 


330x330 


0.02 


[-0.54,0.54] 


1.02 


550 


450 x 450 


0.01 


[-0.54,0.54] 


1.02 


3000 


986 x 986 


0.004 


[-0.52,0.52] 


1.01 



Table 3: Grid information for the impulsively started cylinder cases. 



5.3. External flow over a circular cylinder 

Longer runs were performed to simulate the von Karman 
street behind a circular cylinder. Table 4 lists the results that 
were obtained for flows with different Reynolds numbers. 

The cylinder is again of diameter 1 and in the center of a 
domain of size 30x30. The minimum cell width that was used 
for the runs was 0.02, with it being maintained for the whole 
distance behind the cylinder. The grid to the left, top and bottom 
of the cylinder is exponential with a stretching ratio of 1.02. 
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(a) (b) 
Figure 8: (a) Vorticity field after non-dimensional time 3.0 and (b) time varying drag coefficient for external flow over a circular cylinder at 
Reynolds number 3000. The contour lines in (a) are drawn from -56 to 56 in steps of 4, excluding the zero contour. 
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(b) 

Figure 5: (a) Comparison of the numerical solution on a 150 x 150 
grid with the analytical solution and (b) convergence study, showing 
errors for different grid sizes. 



The initial position of the cylinder is slightly offset in the y- 
direction and it is given a nudge to bring it to the center at the 
beginning of the run to trigger the instability in the flow and 
cause vortex shedding. 



6. Simulations of wake flows and moving boundaries 

The calculations presented above validate our code with both 
analytical results and experimental benchmarks, yet they are 
not useful to evaluate the performance of the ibm when applied 
to moving boundary flows. With moving boundaries, the ibm 
requires regenerating the matrix and the preconditioner at every 



Re 


Q 


c d 


St 


S t experimental [31] 


100 


+0.339 


1.37 + 0.009 


0.166 


0.164 


150 


+0.532 


1.35 + 0.026 


0.185 


0.184 


200 


+0.688 


1.36 + 0.042 


0.197 


0.197 




Table 4: Computed data for flow past a circular cylinder at different 
Reynolds numbers and comparison with experimental results 



Figure 9: Flow over a circular cylinder at Re 200. Vorticity contours 
from -5 to 5 in increments of 0.4 



time step (or every few time steps). For this reason, we include 
here some calculations that reproduce past results for flows with 
moving boundaries, for which we report the run times. All the 
runs were performed on an nvidia Tesla C2050 gpu. The first 
case is a heaving airfoil placed in a uniform flow (performed by 
Lewin and Haj -Hariri [13]) and the second is an airfoil that per- 
forms both pitching and heaving motions, simulating the flap- 
ping motion of an insect wing (performed by Wang et al [30]). 

6.1. Heaving Airfoil 

The simulation was carried out for an elliptic airfoil of 
thickness-to-chord ratio 0.12 and chord length c = 1, heaving 
with a reduced frequency k = 2.0 and non-dimensional maxi- 
mum heaving velocity kh = 0.8 at Reynolds number Re - 500. 
The domain is of size 30 x 30 and the near-body region in 
[-0.52, 0.52] x [-0.52, 0.52] has the minimum cell width of 
0.005. The region [0.52,0.78] immediately behind the airfoil 
has an exponential grid with stretching ratio 1.02, and a uni- 
form grid of size 0.01 follows from that region to the edge of 
the domain. The grid in front of the airfoil is stretched with a 
ratio of 1.02 and the grid in the v-direction above and below 
the body is stretched with a ratio of 1.015. The total size of 
the mesh is 1339 x 686 and the time step used is 0.0005. The 
boundary conditions are the same as those used for the earlier 
cases of external flow over a cylinder. The obtained vorticity 
field (see Figure 10) compares well with the results of Lewin 
and Haj -Hariri [13]. The linear solve to calculate the pressure 
and forces solves for over 900,000 unknowns and each time 
step of the simulation requires 5 seconds. 

6.2. Flapping airfoil 

We consider a flapping airfoil, the motion of which is de- 
scribed by: 

x(t) = — cos(2tt/Y) 
a(t) = a + J3 sin(27r/Y + 0), 

where x(t) is the position of the center of the airfoil and a(t) is 
the angle made by the airfoil with the line of oscillation. The 
airfoil is elliptical with a thickness-to-chord ratio of 0.12 and 
rotates about its center. The Reynolds number is calculated us- 
ing the maximum translational velocity of the airfoil and the 
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-1 -0.5 0.5 1 1.5 2 2.5 3 3.5 -1 -0.5 0.5 1 1.5 2 2.5 3 3.5 

(e) t* = 8.0625 (f) t* = 8.4375 

Figure 10: Vorticity field for the downstroke of a heaving airfoil in a flow with Re = 500, k = 2.0 and kh = 0.8. Vorticity contours are drawn at 
levels ±2, +6, ±10, etc. Compare with Figure 3 in [13]. 
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chord length. We consider the case with symmetrical rotation 
(0 = 0) at Reynolds number 75 and with Aq/c = 2.8, a = n/2, 
P = 7i I A and / = 0.25 Hz. 

The airfoil has chord length 1 and oscillates at the center of a 
domain, each side of which is 30 chord-lengths long. The grid 
is uniform in the region [-2, 2] x [-0.52, 0.52] with cell width 
0.01 and beyond this region it is stretched with a ratio of 1.01 
on all sides, resulting in a mesh size of 930 x 654 cells. The 
time step used is 0.001. The run time for 4000 time steps (one 
cycle) was 164 seconds, when the preconditioner was updated 
every two time steps. 

The vorticity field at different times during the first cycle is 
shown in Figure 1 1 and a comparison of the unsteady lift co- 
efficient with the computational and experimental results pre- 
sented by Wang et al [30] is plotted in Figure 12. The experi- 
ments were conducted with a three-dimensional wing and both 
simulations were performed in two dimensions, hence we don't 
expect them to closely match. But we note that the computa- 
tional results follow the expected trend and agree reasonably 
well with the experimental results. 



7. Conclusions and Future Work 



At this time, we have a validated gpu code for the projec- 
tion ibm, and we have shown convergence with the expected 
rates. Using the free and open-source Cusp and Thrust libraries 
to provide sparse linear algebra functionality, we are able to 
quickly run large experiments to test a wide variety of problems 
including internal Couette flows, external flows past cylinders 
and complex moving bodies. 

In this paper we have outlined a strategy for using the Cusp 
library effectively by reducing memory transfers between the 
host and device and where possible prioritizing the use of the 
most tuned routines. We save the limited memory available 
on the gpu by using a modified routine to calculate a neces- 
sary triple matrix product without needing a costly intermediate 
matrix. Furthermore, we have investigated the suitability of a 
variety of linear solvers for our particular problem, demonstrat- 
ing that for our problems a Conjugate Gradient method with a 
smoothed aggregation AMG preconditioner updated every 2 or 
3 timesteps is the best available for reducing the total runtime 
of the algorithm. 

We have demonstrated the capability of our code for both 
high Reynolds number flows and for flows with complex mov- 
ing boundaries, comparing very well with existing works. 

The next major step for this work, is to extend our algorithm 
into 3 dimensions — while this extension doesn't involve many 
differences to the equations, it will require a significant increase 
in memory usage. This will necessitate modifying the code to 
run in parallel on multiple gpus, a far more significant change. 
A further improvement would be to implement the ibm with an 
adaptive grid - this would reduce both the memory require- 
ments and the amount of work needed per times tep. 
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Figure 12: Unsteady lift coefficient for the first three cycles of a flapping airfoil (Re = 75, = 0, A /c = 2.8). The time is non-dimensionalised 
using the time period of oscillation and the lift coefficient is calculated by normalising the lift force with respect to the maximum of the quasi- 
steady force experienced by the airfoil considered (see [30] for more details) 
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